Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora
نویسندگان
چکیده
This paper presents a method for compiling a large-scale bilingual corpus from a database of movie subtitles. To create the corpus, we propose an algorithm based on Gale and Church’s sentence alignment algorithm(1993). However, our algorithm not only relies on character length information, but also uses subtitle-timing information, which is encoded in the subtitle files. Timing is highly correlated between subtitles in different versions (for the same movie), since subtitles that match should be displayed at the same time. However, the absolute time values can’t be used for alignment, since the timing is usually specified by frame numbers and not by real time, and converting it to real time values is not always possible, hence we use normalized subtitle duration instead. This results in a significant reduction in the alignment error rate.
منابع مشابه
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR erro...
متن کاملConstructing Parallel Corpus from Movie Subtitles
This paper describes a methodology for constructing aligned German-Chinese corpora from movie subtitles. The corpora will be used to train a special machine translation system with intention to automatically translate the subtitles between German and Chinese. Since the common length-based algorithm for alignment shows weakness on short spoken sentences, especially on those from different langua...
متن کاملTEP: Tehran English-Persian Parallel Corpus
Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles,...
متن کاملDual Subtitles as Parallel Corpora
In this paper, we leverage the existence of dual subtitles as a source of parallel data. Dual subtitles present viewers with two languages simultaneously, and are generally aligned in the segment level, which removes the need to automatically perform this alignment. This is desirable as extracted parallel data does not contain alignment errors present in previous work that aligns different subt...
متن کاملImproved Sentence Alignment for Movie Subtitles
Sentence alignment is an essential step in building a parallel corpus. In this paper a specialized approach for the alignment of movie subtitles based on time overlaps is introduced. It is used for creating an extensive multilingual parallel subtitle corpus currently containing about 21 million aligned sentence fragments in 29 languages. Our alignment approach yields significantly higher accura...
متن کامل